Phang Nga
Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning
Spencer, Piyapath T, Kongborrirak, Nanthipat
Yes! In the present-day documenting and preserving endangered languages, the application of Large Language Models (LLMs) presents a promising approach. This paper explores how LLMs, particularly through in-context learning, can assist in generating grammatical information for low-resource languages with limited amount of data. We takes Moklen as a case study to evaluate the efficacy of LLMs in producing coherent grammatical rules and lexical entries using only bilingual dictionaries and parallel sentences of the unknown language without building the model from scratch. Our methodology involves organising the existing linguistic data and prompting to efficiently enable to generate formal XLE grammar. Our results demonstrate that LLMs can successfully capture key grammatical structures and lexical information, although challenges such as the potential for English grammatical biases remain. This study highlights the potential of LLMs to enhance language documentation efforts, providing a cost-effective solution for generating linguistic data and contributing to the preservation of endangered languages.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (6 more...)
Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges
Van Dinh, Nguyen, Dang, Thanh Chi, Nguyen, Luan Thanh, Van Nguyen, Kiet
Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.
- Asia > Vietnam > Hanoi > Hanoi (0.14)
- Asia > Vietnam > Thanh Hóa Province > Thanh Hóa (0.04)
- Asia > Vietnam > Hưng Yên Province > Hưng Yên (0.04)
- (65 more...)
Information Extraction based on Named Entity for Tourism Corpus
Chantrapornchai, Chantana, Tunsakul, Aphisit
Tourism information is scattered around nowadays. To search for the information, it is usually time consuming to browse through the results from search engine, select and view the details of each accommodation. In this paper, we present a methodology to extract particular information from full text returned from the search engine to facilitate the users. Then, the users can specifically look to the desired relevant information. The approach can be used for the same task in other domains. The main steps are 1) building training data and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used to train for creating vocabulary embedding. Also, it is used for creating annotated data. The process of creating named entity annotation is presented. Then, the recognition model of a given entity type can be built. From the experiments, given hotel description, the model can extract the desired entity,i.e, name, location, facility. The extracted data can further be stored as a structured information, e.g., in the ontology format, for future querying and inference. The model for automatic named entity identification, based on machine learning, yields the error ranging 8%-25%.
Semantic Search using Spreading Activation based on Ontology
Currently, the text document retrieval systems have many challenges in exploring the semantics of queries and documents. Each query implies information which does not appear in the query but the documents related with the information are also expected by user. The disadvantage of the previous spreading activation algorithms could be many irrelevant concepts added to the query. In this paper, a proposed novel algorithm is only activate and add to the query named entities which are related with original entities in the query and explicit relations in the query.
- Europe > United Kingdom (0.06)
- Asia > Southeast Asia (0.05)
- Asia > Thailand > Phuket > Phuket (0.05)
- (6 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Systems & Languages > Programming Languages (0.61)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.61)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.58)